{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CVjHcUsRyM7I"
   },
   "source": [
    "# Contextual RAG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WjJnK8VHx4r1"
   },
   "source": [
    "In this notebook, we'll explore Contextual Retrieval, a technique to improve the accuracy of vector search by providing additional context for the chunks of a document, by inputting both the document and the chunk to an LLM and asking it to provide a succinct context for the chunk within the document.\n",
    "\n",
    "This is a way to combat the lost context problem that occurs in chunking, e.g., if a text is split into sentences, the context of later sentences as they relate to earlier sentences is lost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pvGgyVYb7JNs"
   },
   "source": [
    "The idea here is to do these things:\n",
    "1. For each document, make chunks (Nothing new. Just like Vanilla RAG)\n",
    "2. For each Chunk you created, as an LLM create a context of that Chunk (You see this is new!)\n",
    "3. Append that context to the original chunk\n",
    "4. Create BM-25 and Vector Index based on those chunks for Hybrid Search (New to you? See this amazing blog by LanceDB on hybrid search)\n",
    "5. Search as usual!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dECTS1wWNw_s"
   },
   "source": [
    "**Change Runtime with GPU to run this notebook**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rmiiI22M4aPK"
   },
   "source": [
    "## Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "aGP9H97ghb9-",
    "outputId": "eb83a194-9abb-4edb-dd90-0765404ee35f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.4/44.4 kB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m383.5/383.5 kB\u001b[0m \u001b[31m17.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.2/24.2 MB\u001b[0m \u001b[31m58.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m29.2/29.2 MB\u001b[0m \u001b[31m47.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m245.3/245.3 kB\u001b[0m \u001b[31m20.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.9/9.9 MB\u001b[0m \u001b[31m72.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m471.6/471.6 kB\u001b[0m \u001b[31m28.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m92.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m10.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.4/76.4 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.0/78.0 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m318.9/318.9 kB\u001b[0m \u001b[31m24.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.9/2.9 MB\u001b[0m \u001b[31m94.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m17.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.7/98.7 kB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h"
     ]
    }
   ],
   "source": [
    "# Install\n",
    "!pip install -U openai lancedb einops sentence-transformers transformers datasets tantivy rerankers -qq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xrd5JEr2K83M",
    "outputId": "0c40a5bc-1c45-46ba-9a39-f2824464904e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-10-07 09:03:31--  https://raw.githubusercontent.com/anthropics/anthropic-cookbook/refs/heads/main/skills/contextual-embeddings/data/codebase_chunks.json\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1126046 (1.1M) [text/plain]\n",
      "Saving to: ‘./data/codebase_chunks.json’\n",
      "\n",
      "codebase_chunks.jso 100%[===================>]   1.07M  --.-KB/s    in 0.03s   \n",
      "\n",
      "2024-10-07 09:03:32 (41.2 MB/s) - ‘./data/codebase_chunks.json’ saved [1126046/1126046]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Get the data\n",
    "!wget -P ./data/ https://raw.githubusercontent.com/anthropics/anthropic-cookbook/refs/heads/main/skills/contextual-embeddings/data/codebase_chunks.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ygbayCQH6tlr"
   },
   "source": [
    "### Set OPENAI and Anthropic API KEY as env variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "hOP0kua_q2lp"
   },
   "outputs": [],
   "source": [
    "# IMPORT\n",
    "\n",
    "import os, re, random, json\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "import torch\n",
    "import gc\n",
    "import lancedb\n",
    "import openai\n",
    "from lancedb.embeddings import get_registry\n",
    "from lancedb.pydantic import LanceModel, Vector\n",
    "from tqdm.auto import tqdm\n",
    "from openai import OpenAI\n",
    "\n",
    "pd.set_option(\"max_colwidth\", 400)\n",
    "\n",
    "OAI_KEY = \"sk-proj-....\"  # Replace with your OpenAI Key\n",
    "os.environ[\"OPENAI_API_KEY\"] = OAI_KEY\n",
    "\n",
    "gpt_client = OpenAI(api_key=OAI_KEY)  # For Contenxt text generation\n",
    "\n",
    "model = (\n",
    "    get_registry()\n",
    "    .get(\"sentence-transformers\")\n",
    "    .create(name=\"BAAI/bge-small-en-v1.5\", device=\"cuda\")\n",
    ")  # For embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BcX8B04yCp7x"
   },
   "source": [
    "## Data Loading and Chunking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 67,
     "referenced_widgets": [
      "9bf24e12edff4979b90d5e434e0ca0f8",
      "29263763581840eeaf88d7676eb9331c",
      "4a86eb7a7e4e420d8a56c2a99fb1ad52",
      "38dd263aeb734848b7d43924cf9919e7",
      "9fae9cb7605e4fbba00466ba522132d8",
      "d19b4601f5204865ad90fcea5720620d",
      "324958d0d7fe41979a4056c00d65e380",
      "4426292e98be43dfa09239f5b64899b1",
      "e074acb753424b6aa4f125a62b7bfb12",
      "f7f771f6e45843b99264ca47123314e7",
      "1d873893430248bd9b989f5c6e2baed3"
     ]
    },
    "id": "GNdKH6GuOuo3",
    "outputId": "aafbcbbf-e465-4eda-d874-0047b11376a6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Debugging Mode: Using few doc samples only \n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9bf24e12edff4979b90d5e434e0ca0f8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Processing 29 chunks from 5 docs:   0%|          | 0/5 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def load_raw_data(datapath=\"/content/data/codebase_chunks.json\", debugging=False):\n",
    "    with open(datapath, \"r\") as f:\n",
    "        dataset = json.load(f)\n",
    "    if debugging:\n",
    "        print(\"Debugging Mode: Using few doc samples only \")\n",
    "        dataset = dataset[:5]  # just use a sample only\n",
    "\n",
    "    data = []\n",
    "    num_docs = len(dataset)\n",
    "    total_chunks = sum(len(doc[\"chunks\"]) for doc in dataset)\n",
    "\n",
    "    with tqdm(\n",
    "        total=num_docs,\n",
    "        desc=f\"Processing {total_chunks} chunks from {len(dataset)} docs\",\n",
    "    ) as pbar:\n",
    "        for doc in dataset:  # Full document\n",
    "            for chunk in doc[\"chunks\"]:  # Each document has multiple chunks\n",
    "                data.append(\n",
    "                    {\n",
    "                        \"raw_chunk\": chunk[\n",
    "                            \"content\"\n",
    "                        ],  # We won't make Embedding from this instead we'll create new Context based on Chunk and full_doc\n",
    "                        \"full_doc\": doc[\n",
    "                            \"content\"\n",
    "                        ],  # This shouldn't be saved in DB as it'll grow the DB size to a lot\n",
    "                        \"doc_id\": doc[\"doc_id\"],\n",
    "                        \"original_uuid\": doc[\"original_uuid\"],\n",
    "                        \"chunk_id\": chunk[\"chunk_id\"],\n",
    "                        \"original_index\": chunk[\"original_index\"],\n",
    "                    }\n",
    "                )\n",
    "                pbar.update(1)\n",
    "\n",
    "    return data\n",
    "\n",
    "\n",
    "raw_chunks = load_raw_data(\n",
    "    debugging=True\n",
    ")  # For debugging and tutorial purpose, just use ther first few documents only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J9ZADS66CuK9"
   },
   "source": [
    "## Vanilla RAG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 478,
     "referenced_widgets": [
      "e727f44f93ab497db18b9c5dc40ec2b6",
      "7e405112ebdc4fc1bbddce75e0b602c6",
      "9047c67dec114c81b74b5b1dd2605d17",
      "f13c70fc727a4d098d6683e6e587768b",
      "364f75266ab94f569c7cf9cabdaa64d2",
      "7cec131528c542759877f133cf493395",
      "66ef89da52f24e3eba1b2c6a13922d49",
      "265c059525974be99d8888cbf6fdd580",
      "ef10e2fab40148c1b8fb49157efd593b",
      "1f3ea758dc0749619e89b1881b0322a8",
      "df74287d61374899bcd1b45278294878",
      "2c2b5989a41f4773b49ea6a9a4112eaf",
      "188af7fdbf4046c291be0bf2b91616f5",
      "db515d8d6a8a46bba355ca07c5b51145",
      "97cbdac0e2954d7d9f41a4b8b5621155",
      "62be337e05f74a9da10661f92682b3de",
      "f5539f0dc6854aac858c46508be5fef9",
      "802158705ad64e91b127046e6faa353c",
      "1c97488589394ad0a1cdd05f4c9a29c5",
      "ccebf8f656a54fb6932d9b3db4c71b61",
      "08fa8c12c7b1462fa84f7bb0712720c5",
      "66768ecdacee49d8be8a5b905b8932f5",
      "c43e71cdba594e72bae3598d1ca68afe",
      "36631b7669114c29af38ce2618e04ca1",
      "ace61014a4fc41afb8a1203c6760cf8e",
      "457867c908554f42ac6cf36bca8a437a",
      "c8dddc4bb80844b7a76d052260e10537",
      "8739d67933bd426f899873d160ff8985",
      "86c7d2070b974676a3cc5e890488fe08",
      "fe185c8b0c47437ca20d7d6dd9e3444c",
      "6874e24d73194505a9bd15ef495fd654",
      "5fa2ae635af0454e9cff624e5eb3d82d",
      "514f33ceebf34cff84b8338592f150ec",
      "8d902adb9d834935a052f179b3b68f3c",
      "ba0271b849964f12a638a4750d6435df",
      "edda96531cc74a5ca8b8f6b7a5361a61",
      "b88d0d5089cb49c389b419390e85059e",
      "69408f465cec44f2807e3897f1964fff",
      "1de439bc47714f50a65ae5caa87c6f9e",
      "3d8a354a84c44819a6d4b8252e0b06a1",
      "c7340c6286ba43b09cb76b8addcbd44d",
      "f3e302ec22904880bca12dc8a45cec49",
      "e1c63aba82d746b8b2118f3cba77dc8f",
      "31e3d0782f7a4eb09df9c788b5b179ae",
      "d1e6f550b0bf4b1ab7dbbbcd907307c9",
      "dd9e37313f204c1bb1c7187b92d57f55",
      "9e61abba44d64d2c9b0d098c601f1678",
      "646ff3b9921444dfa85d01d8a59e450c",
      "55975bead127475eaaa09daf04ee8010",
      "8b5780fe638b4fd1adbd48692b89ac51",
      "5e842097c1f94330a41b9dfc36ecb5a4",
      "a9e373f2df644d9fa5196f9221483489",
      "d068011367cd4faeb88b4e083e2ede0b",
      "40211ba9bc6747e88a67a80673746906",
      "68a98fbfc5fb4327b59e51e328203c23",
      "839a3fad85b64888a05795a96328b89e",
      "07b59a3407cb4e96a082cef88aa9df8a",
      "6a1e1af3113e4e768908301125c4665b",
      "14f4b076951a48da8c471ec3f9b75bd5",
      "099a012a548f4bfd910dd5e2f312f1b8",
      "65ad654f4b5c4e059427ac91646ab5f7",
      "47b5188f745f491d9eb7ce267920cdcb",
      "0e6e2482f1834b169ffe0c6993d7ccff",
      "dae2036208784da9be2e93a01c98e92c",
      "2464750c8efc4abfb65c50c0bbb46102",
      "a2c48500ef944973a11609211aee69fc",
      "9a25c2ef8c7743569b099792b6a45abb",
      "f36dbc0f72fc40bc97af261f659faa2a",
      "0a2cb841052d42b18a6fe87fb8ef04c7",
      "2372b93db8ef4c7894caa8c748e89605",
      "cb1189ac838641d0b2b5bfa46a9ac089",
      "c48a8a3519f246179d2e0df4db14320c",
      "a2e9729f33154457a799c2617f70e096",
      "dd528f0ad75a404882a9a7849d430d32",
      "3c40555f33f24f3f8d1b6c5980219e9a",
      "8508cc1b72af4a3c995d04b673794f66",
      "b68331dae8a44102a0edd0649fd0134d",
      "dfc8130e4ef047229f7bf6d9fcf20540",
      "55c19cc546384f339e0df70c019bb186",
      "378dc2334e9f479bb52306d1348f3efe",
      "bae5bdf64c3c43d7a15cff61997e5d57",
      "9bb62f03eb5d4f429f57e6fcfd84828b",
      "d46bfc05eb554bc59e2d177b8eb15561",
      "3a5e6a93919b434bafe07921d0b9f980",
      "d1a98680d6424ef99fa785cce8b093b4",
      "8d9a90dafb4849288dedcf3d18129a01",
      "a57c0219b7174c69a8faeb817c92ecda",
      "9b682a4b36e0438c9a12fe94c77f32f9",
      "bf7645e0c83a4fe0a41c5cfb667f8267",
      "1a57ea70b41941dd91e5405f889e0d81",
      "c2787c62fd7d4cca821b1001bbdebbfc",
      "d427d718d053494699e38e9762aa2212",
      "2cd3295a94a345abba0eb83e8abba2c2",
      "9d342afc5ecd498cbee4cf1981ce96d5",
      "43ff192aafb24006877aef2740ebe860",
      "8a7a8be83a0c4c7498bc980f9c1ead51",
      "8cf4519159d0436f86b1c796622785f0",
      "e15f4132aff24581ada31faf946fb617",
      "90f0b0d8465548149b2f1c57d6f011d2",
      "e94a5f6f6db943ec82b204819ce3ab36",
      "5f39a0ca25744408a40469e474a1a0f9",
      "3916cfd13bd84f10a76c5cdff7eab635",
      "677d473c02b14c68a110cb9f0b68526c",
      "03fc5537535547349f260949355f8da9",
      "100ffe9665dd4d33a3895b84d816bdbf",
      "f8dda518bde1407196e08c4fade81983",
      "bf84f45e332f48bab8a3276bff8463b8",
      "0381917ef22e4e5785c199cd847ccefa",
      "95d2fdbacbcd4722872237cf9aad3a03",
      "1e119fd538c942d49a289a3051dede00",
      "ad82f13e097a43f2bc2d4c9a5f72d6e5",
      "f88437c2a9f042ff83797044b97ae39c",
      "88157f6add15444188dc3385ec57ec24",
      "6257e19e7d5f4ad9a43e39b593243c0c",
      "06b52779e16c41c49234188bc6b354a9",
      "4d92f14e437a452e84bf387d52a326e0",
      "977beddf30214fa5ac0aaf37459ad85d",
      "cfb028f252e34f8c9c4703e0e84a3b43",
      "1d6ee260c6d149f7ac8ca856f0e266ea",
      "d21ccfc8046c46b5a451ad7320c8313c",
      "0e400c7d939c43099eac09bb08746552"
     ]
    },
    "id": "n_O8DxuqGPKx",
    "outputId": "48970bdf-fd5c-4295-83a6-4d8c8cfe0d4b"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
      "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
      "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
      "You will be able to reuse this secret in all of your notebooks.\n",
      "Please note that authentication is recommended but still optional to access public models or datasets.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e727f44f93ab497db18b9c5dc40ec2b6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2c2b5989a41f4773b49ea6a9a4112eaf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c43e71cdba594e72bae3598d1ca68afe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8d902adb9d834935a052f179b3b68f3c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d1e6f550b0bf4b1ab7dbbbcd907307c9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "839a3fad85b64888a05795a96328b89e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9a25c2ef8c7743569b099792b6a45abb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "dfc8130e4ef047229f7bf6d9fcf20540",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bf7645e0c83a4fe0a41c5cfb667f8267",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e94a5f6f6db943ec82b204819ce3ab36",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ad82f13e097a43f2bc2d4c9a5f72d6e5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "class VanillaDocuments(LanceModel):\n",
    "    vector: Vector(model.ndims()) = model.VectorField()  # Default field\n",
    "    raw_chunk: str = (\n",
    "        model.SourceField()\n",
    "    )  # the Columns (field) in DB whose Embedding we'll create\n",
    "    doc_id: str  # rest is just metadata below\n",
    "    original_uuid: str\n",
    "    chunk_id: str\n",
    "    original_index: int\n",
    "    full_doc: str\n",
    "\n",
    "\n",
    "db = lancedb.connect(\"./db\")\n",
    "vanilla_table = db.create_table(\"vanilla_documents\", schema=VanillaDocuments)\n",
    "\n",
    "vanilla_table.add(raw_chunks)  # ingest docs with auto-vectorization\n",
    "vanilla_table.create_fts_index(\n",
    "    \"raw_chunk\"\n",
    ")  # Create a fts index before so that we can use BM-25 later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "nGucBjg0NmNE"
   },
   "outputs": [],
   "source": [
    "QUERY = \"implement corpus management with event handling\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 404
    },
    "id": "kQ9MggTd1nQU",
    "outputId": "e7061ad0-b7c9-4951-b4db-89ffbb443068"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"            drop([\\\"vector\\\", \\\"original_uuid\\\"], axis = 1)\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"raw_chunk\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \\\"tui\\\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \\\"tui\\\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedbacks::{CrashFeedback, MaxMapFeedback},\\n    fuzzer::{Fuzzer, StdFuzzer},\\n    inputs::{BytesInput, HasTargetBytes},\\n    mutators::{StdScheduledMutator, StringCategoryRandMutator, StringSubcategoryRandMutator},\\n    observers::StdMapObserver,\\n    schedulers::QueueScheduler,\\n    stages::{mutational::StdMutationalStage, StringIdentificationStage},\\n    state::StdState,\\n    Evaluator,\\n};\\nuse libafl_bolts::{current_nanos, rands::StdRand, tuples::tuple_list, AsSlice};\\n\\n\",\n          \"use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \\\"C\\\" {\\n    fn libafl_check_malloc_size(ptr: *const c_void) -> usize;\\n}\\n\\nstatic RUNNING: AtomicBool = AtomicBool::new(false);\\nstatic OOMED: AtomicBool = AtomicBool::new(false);\\nstatic RSS_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n// 2GB, which is the default\\nstatic MALLOC_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n\\nstatic MALLOC_SIZE: AtomicUsize = AtomicUsize::new(0);\\n\\n\",\n          \"    // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \\\"tui\\\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\\\"{s}\\\"));\\n    #[cfg(feature = \\\"tui\\\")]\\n    let ui = TuiUI::with_version(String::from(\\\"Baby Fuzzer\\\"), String::from(\\\"0.0.1\\\"), false);\\n    #[cfg(feature = \\\"tui\\\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the various events generated during the fuzzing loop\\n    // such as the notification of the addition of a new item to the corpus\\n    let mut mgr = SimpleEventManager::new(mon);\\n\\n    // A queue policy to get testcasess from the corpus\\n    let scheduler = QueueScheduler::new();\\n\\n    // A fuzzer with feedbacks and a corpus scheduler\\n    let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);\\n\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_3\",\n          \"doc_2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_2_chunk_0\",\n          \"doc_3_chunk_0\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"original_index\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 2,\n        \"min\": 0,\n        \"max\": 4,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          4,\n          0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"full_doc\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \\\"C\\\" {\\n    fn libafl_check_malloc_size(ptr: *const c_void) -> usize;\\n}\\n\\nstatic RUNNING: AtomicBool = AtomicBool::new(false);\\nstatic OOMED: AtomicBool = AtomicBool::new(false);\\nstatic RSS_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n// 2GB, which is the default\\nstatic MALLOC_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n\\nstatic MALLOC_SIZE: AtomicUsize = AtomicUsize::new(0);\\n\\n/// malloc hook which will be invoked if address sanitizer is present. Used to detect if the target makes a malloc call\\n/// that will exceed the permissible size\\n///\\n/// # Safety\\n/// Is only safe to call with valid freshly allocated pointers backed by allocations of `size`.\\n#[no_mangle]\\npub unsafe extern \\\"C\\\" fn __sanitizer_malloc_hook(ptr: *const c_void, size: usize) {\\n    if RUNNING.load(Ordering::Relaxed) {\\n        let size = match unsafe { libafl_check_malloc_size(ptr) } {\\n            0 => size, // either the malloc size function didn't work or it's really zero-sized\\n            real => real,\\n        };\\n\\n        let total = MALLOC_SIZE.fetch_add(size, Ordering::Relaxed) + size;\\n        if (size > MALLOC_MAX.load(Ordering::Relaxed) || total > RSS_MAX.load(Ordering::Relaxed))\\n            && !OOMED.swap(true, Ordering::Relaxed)\\n        {\\n            unsafe {\\n                // we need to kill the process in a way that immediately triggers the crash handler\\n                libc::raise(SIGABRT);\\n            }\\n        }\\n    }\\n}\\n\\n/// free hook which will be invoked if ASAN is present. Used to detect if the target makes a malloc call that will\\n/// exceed the permissible size\\n///\\n/// # Safety\\n/// Is only safe to call with valid allocated pointers, about to be freed.\\n#[no_mangle]\\npub unsafe extern \\\"C\\\" fn __sanitizer_free_hook(ptr: *const c_void) {\\n    if RUNNING.load(Ordering::Relaxed) {\\n        let size = unsafe { libafl_check_malloc_size(ptr) };\\n        MALLOC_SIZE\\n            .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |existing| {\\n                Some(existing.saturating_sub(size))\\n            })\\n            .expect(\\\"must complete successfully\\\");\\n    }\\n}\\n\\nconst OOM_OBS_NAME: &str = \\\"libfuzzer-like-oom\\\";\\n\\n/// Observer which detects if the target would run out of memory or otherwise violate the permissible usage of malloc\\n#[derive(Debug, Serialize, Deserialize)]\\npub struct OomObserver {\\n    oomed: bool,\\n}\\n\\nimpl OomObserver {\\n    /// Create a [`OomObserver`] with the provided `rss_max` (total heap size) and `malloc_max` (largest permissible malloc\\n    /// allocation size)\\n    pub fn new(rss_max: usize, malloc_max: usize) -> Self {\\n        RSS_MAX.store(rss_max, Ordering::Relaxed);\\n        MALLOC_MAX.store(malloc_max, Ordering::Relaxed);\\n        Self { oomed: false }\\n    }\\n}\\n\\nimpl Named for OomObserver {\\n    // strictly one name to prevent two from being registered\\n    fn name(&self) -> &str {\\n        OOM_OBS_NAME\\n    }\\n}\\n\\nimpl<S> Observer<S> for OomObserver\\nwhere\\n    S: UsesInput,\\n{\\n    fn pre_exec(&mut self, _state: &mut S, _input: &S::Input) -> Result<(), Error> {\\n        OOMED.store(false, Ordering::Relaxed);\\n        // must reset for platforms which do not offer malloc tracking\\n        MALLOC_SIZE.store(0, Ordering::Relaxed);\\n        RUNNING.store(true, Ordering::Relaxed);\\n        Ok(())\\n    }\\n\\n    fn post_exec(\\n        &mut self,\\n        _state: &mut S,\\n        _input: &S::Input,\\n        _exit_kind: &ExitKind,\\n    ) -> Result<(), Error> {\\n        RUNNING.store(false, Ordering::Relaxed);\\n        self.oomed = OOMED.load(Ordering::Relaxed);\\n        Ok(())\\n    }\\n\\n    fn pre_exec_child(&mut self, state: &mut S, input: &S::Input) -> Result<(), Error> {\\n        self.pre_exec(state, input)\\n    }\\n\\n    fn post_exec_child(\\n        &mut self,\\n        state: &mut S,\\n        input: &S::Input,\\n        exit_kind: &ExitKind,\\n    ) -> Result<(), Error> {\\n        self.post_exec(state, input, exit_kind)\\n    }\\n}\\n\\n/// Feedback for the similarly named [`OomObserver`] to detect if the target crashed due to an observed OOM\\n#[derive(Debug, Serialize, Deserialize, Copy, Clone, Default)]\\npub struct OomFeedback;\\n\\nimpl OomFeedback {\\n    /// Whether the target OOM'd in the last execution\\n    pub fn oomed() -> bool {\\n        OOMED.load(Ordering::Relaxed)\\n    }\\n}\\n\\nimpl Named for OomFeedback {\\n    fn name(&self) -> &str {\\n        \\\"oom\\\"\\n    }\\n}\\n\\nimpl<S> Feedback<S> for OomFeedback\\nwhere\\n    S: State,\\n{\\n    fn is_interesting<EM, OT>(\\n        &mut self,\\n        _state: &mut S,\\n        _manager: &mut EM,\\n        _input: &S::Input,\\n        _observers: &OT,\\n        _exit_kind: &ExitKind,\\n    ) -> Result<bool, Error>\\n    where\\n        EM: EventFirer<State = S>,\\n        OT: ObserversTuple<S>,\\n    {\\n        Ok(Self::oomed())\\n    }\\n}\\n\",\n          \"#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \\\"tui\\\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \\\"tui\\\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedbacks::{CrashFeedback, MaxMapFeedback},\\n    fuzzer::{Fuzzer, StdFuzzer},\\n    inputs::{BytesInput, HasTargetBytes},\\n    mutators::{StdScheduledMutator, StringCategoryRandMutator, StringSubcategoryRandMutator},\\n    observers::StdMapObserver,\\n    schedulers::QueueScheduler,\\n    stages::{mutational::StdMutationalStage, StringIdentificationStage},\\n    state::StdState,\\n    Evaluator,\\n};\\nuse libafl_bolts::{current_nanos, rands::StdRand, tuples::tuple_list, AsSlice};\\n\\n/// Coverage map with explicit assignments due to the lack of instrumentation\\nstatic mut SIGNALS: [u8; 64] = [0; 64];\\nstatic mut SIGNALS_PTR: *mut u8 = unsafe { SIGNALS.as_mut_ptr() };\\n\\n/// Assign a signal to the signals map\\nfn signals_set(idx: usize) {\\n    unsafe { write(SIGNALS_PTR.add(idx), 1) };\\n}\\n\\n#[allow(clippy::similar_names, clippy::manual_assert)]\\npub fn main() {\\n    // The closure that we want to fuzz\\n    let mut harness = |input: &BytesInput| {\\n        let target = input.target_bytes();\\n        let buf = target.as_slice();\\n        let goal = b\\\"abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz\\\";\\n        let mut i = 0;\\n        for _ in buf.iter().zip(goal).take_while(|(b, c)| b == c) {\\n            signals_set(i);\\n            i += 1;\\n        }\\n        if i == goal.len() {\\n            #[cfg(unix)]\\n            panic!(\\\"Artificial bug triggered =)\\\");\\n\\n            #[cfg(windows)]\\n            unsafe {\\n                write_volatile(0 as *mut u32, 0);\\n            }\\n        }\\n        ExitKind::Ok\\n    };\\n\\n    // Create an observation channel using the signals map\\n    let observer = unsafe { StdMapObserver::from_mut_ptr(\\\"signals\\\", SIGNALS_PTR, SIGNALS.len()) };\\n\\n    // Feedback to rate the interestingness of an input\\n    let mut feedback = MaxMapFeedback::new(&observer);\\n\\n    // A feedback to choose if an input is a solution or not\\n    let mut objective = CrashFeedback::new();\\n\\n    // create a State from scratch\\n    let mut state = StdState::new(\\n        // RNG\\n        StdRand::with_seed(current_nanos()),\\n        // Corpus that will be evolved, we keep it in memory for performance\\n        InMemoryCorpus::new(),\\n        // Corpus in which we store solutions (crashes in this example),\\n        // on disk so the user can get them after stopping the fuzzer\\n        OnDiskCorpus::new(PathBuf::from(\\\"./crashes\\\")).unwrap(),\\n        // States of the feedbacks.\\n        // The feedbacks can report the data that should persist in the State.\\n        &mut feedback,\\n        // Same for objective feedbacks\\n        &mut objective,\\n    )\\n    .unwrap();\\n\\n    // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \\\"tui\\\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\\\"{s}\\\"));\\n    #[cfg(feature = \\\"tui\\\")]\\n    let ui = TuiUI::with_version(String::from(\\\"Baby Fuzzer\\\"), String::from(\\\"0.0.1\\\"), false);\\n    #[cfg(feature = \\\"tui\\\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the various events generated during the fuzzing loop\\n    // such as the notification of the addition of a new item to the corpus\\n    let mut mgr = SimpleEventManager::new(mon);\\n\\n    // A queue policy to get testcasess from the corpus\\n    let scheduler = QueueScheduler::new();\\n\\n    // A fuzzer with feedbacks and a corpus scheduler\\n    let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);\\n\\n    // Create the executor for an in-process function with just one observer\\n    let mut executor = InProcessExecutor::new(\\n        &mut harness,\\n        tuple_list!(observer),\\n        &mut fuzzer,\\n        &mut state,\\n        &mut mgr,\\n    )\\n    .expect(\\\"Failed to create the Executor\\\");\\n\\n    // Generate 8 initial inputs\\n    fuzzer\\n        .evaluate_input(\\n            &mut state,\\n            &mut executor,\\n            &mut mgr,\\n            BytesInput::new(vec![b'a']),\\n        )\\n        .unwrap();\\n\\n    // Setup a mutational stage with a basic bytes mutator\\n    let mutator = StdScheduledMutator::new(tuple_list!(\\n        StringCategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator\\n    ));\\n    let mut stages = tuple_list!(\\n        StringIdentificationStage::new(),\\n        StdMutationalStage::transforming(mutator)\\n    );\\n\\n    fuzzer\\n        .fuzz_loop(&mut stages, &mut executor, &mut state, &mut mgr)\\n        .expect(\\\"Error in the fuzzing loop\\\");\\n}\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_relevance_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          0.016393441706895828,\n          0.0320020467042923\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-e3784548-96c9-4b68-a488-f97f27c6e27f\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>raw_chunk</th>\n",
       "      <th>doc_id</th>\n",
       "      <th>chunk_id</th>\n",
       "      <th>original_index</th>\n",
       "      <th>full_doc</th>\n",
       "      <th>_relevance_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>doc_2</td>\n",
       "      <td>doc_2_chunk_0</td>\n",
       "      <td>0</td>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>0.032002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...</td>\n",
       "      <td>doc_3</td>\n",
       "      <td>doc_3_chunk_0</td>\n",
       "      <td>0</td>\n",
       "      <td>use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>// The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \"tui\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\"{s}\"));\\n    #[cfg(feature = \"tui\")]\\n    let ui = TuiUI::with_version(String::from(\"Baby Fuzzer\"), String::from(\"0.0.1\"), false);\\n    #[cfg(feature = \"tui\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the ...</td>\n",
       "      <td>doc_2</td>\n",
       "      <td>doc_2_chunk_4</td>\n",
       "      <td>4</td>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e3784548-96c9-4b68-a488-f97f27c6e27f')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-e3784548-96c9-4b68-a488-f97f27c6e27f button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-e3784548-96c9-4b68-a488-f97f27c6e27f');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-144a2d93-f2a9-4705-8710-2364b525df27\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-144a2d93-f2a9-4705-8710-2364b525df27')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-144a2d93-f2a9-4705-8710-2364b525df27 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                                                                                                                                                                                                                                                                                                                                                                                         raw_chunk  \\\n",
       "0  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "1  use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...   \n",
       "2      // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \"tui\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\"{s}\"));\\n    #[cfg(feature = \"tui\")]\\n    let ui = TuiUI::with_version(String::from(\"Baby Fuzzer\"), String::from(\"0.0.1\"), false);\\n    #[cfg(feature = \"tui\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the ...   \n",
       "\n",
       "  doc_id       chunk_id  original_index  \\\n",
       "0  doc_2  doc_2_chunk_0               0   \n",
       "1  doc_3  doc_3_chunk_0               0   \n",
       "2  doc_2  doc_2_chunk_4               4   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                          full_doc  \\\n",
       "0  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "1  use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...   \n",
       "2  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "\n",
       "   _relevance_score  \n",
       "0          0.032002  \n",
       "1          0.016393  \n",
       "2          0.016393  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vanilla_table.search(QUERY, query_type=\"hybrid\").limit(3).to_pandas().drop(\n",
    "    [\"vector\", \"original_uuid\"], axis=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HoGDypAnC5ZS"
   },
   "source": [
    "## Contextual Retrieval with Prompt Caching"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "YGuX1MAjIsA0"
   },
   "outputs": [],
   "source": [
    "def create_context_prompt(full_document_text, chunk_text):\n",
    "    prompt = f\"\"\"\n",
    "<document>\n",
    "{full_document_text}\n",
    "</document>\n",
    "\n",
    "Here is the chunk we want to situate within the whole document\n",
    "<chunk>\n",
    "{chunk_text}\n",
    "</chunk>\n",
    "\n",
    "Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.\n",
    "Answer only with the succinct context and nothing else.\n",
    "\"\"\"\n",
    "    return (\n",
    "        prompt,\n",
    "        gpt_client.chat.completions.create(\n",
    "            model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": prompt}]\n",
    "        )\n",
    "        .choices[0]\n",
    "        .message.content.strip(),\n",
    "    )\n",
    "\n",
    "\n",
    "for chunk in raw_chunks:\n",
    "    prompt, response = create_context_prompt(chunk[\"full_doc\"], chunk[\"raw_chunk\"])\n",
    "    chunk[\"prompt\"] = prompt\n",
    "    chunk[\"chunk_context\"] = response\n",
    "    chunk[\"chunk_with_context\"] = chunk[\"chunk_context\"] + \"\\n\" + chunk[\"raw_chunk\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "id": "rCVdPLYm_oig"
   },
   "outputs": [],
   "source": [
    "class Documents(LanceModel):\n",
    "    vector: Vector(model.ndims()) = model.VectorField()  # Default field\n",
    "    text: str = (\n",
    "        model.SourceField()\n",
    "    )  # the Columns (field) in DB whose Embedding we'll create\n",
    "    doc_id: str  # rest is just metadata below\n",
    "    raw_chunk: str\n",
    "    full_doc: str\n",
    "    original_uuid: str\n",
    "    chunk_id: str\n",
    "    original_index: int\n",
    "\n",
    "\n",
    "KEYS = [\n",
    "    \"raw_chunk\",\n",
    "    \"full_doc\",\n",
    "    \"doc_id\",\n",
    "    \"original_uuid\",\n",
    "    \"chunk_id\",\n",
    "    \"original_index\",\n",
    "]\n",
    "\n",
    "context_documents = []\n",
    "for chunk in raw_chunks:\n",
    "    temp = {\n",
    "        \"text\": chunk[\"chunk_with_context\"]\n",
    "    }  # Create embedding from 'text' field which is (Chunk_Context_i + Chunk_i)\n",
    "\n",
    "    for key in KEYS:\n",
    "        temp[key] = chunk[key]  # Get other metadata\n",
    "    context_documents.append(temp)\n",
    "\n",
    "\n",
    "context_table = db.create_table(\"added_context_table\", schema=Documents)\n",
    "\n",
    "context_table.add(context_documents)  # ingest docs with auto-vectorization\n",
    "context_table.create_fts_index(\n",
    "    \"text\"\n",
    ")  # Create a fts index before so that we can use BM-25 later"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_9FcV2EXL0KI"
   },
   "source": [
    "Let's search with Contextual Retrieval and see the difference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 647
    },
    "id": "c1IqsxywNt3t",
    "outputId": "1ec103b2-50ee-47f3-a8d7-dd31b23d3745"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"            drop([\\\"vector\\\", \\\"original_uuid\\\"], axis = 1)\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"This chunk is part of the main function in a fuzzing application, specifically focusing on the setup of the monitor for displaying fuzzer statistics and the event manager for handling events during the fuzzing loop. It follows the initialization of state, feedback mechanisms, and sets up the fuzzer with a scheduling policy for managing test cases from the corpus.\\n    // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \\\"tui\\\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\\\"{s}\\\"));\\n    #[cfg(feature = \\\"tui\\\")]\\n    let ui = TuiUI::with_version(String::from(\\\"Baby Fuzzer\\\"), String::from(\\\"0.0.1\\\"), false);\\n    #[cfg(feature = \\\"tui\\\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the various events generated during the fuzzing loop\\n    // such as the notification of the addition of a new item to the corpus\\n    let mut mgr = SimpleEventManager::new(mon);\\n\\n    // A queue policy to get testcasess from the corpus\\n    let scheduler = QueueScheduler::new();\\n\\n    // A fuzzer with feedbacks and a corpus scheduler\\n    let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);\\n\\n\",\n          \"The chunk contains Rust code that includes the necessary imports and configurations for a fuzzing framework using the libafl library. It sets up the environment for the fuzzer, including the configuration for different operating systems and features, and imports various modules required for corpus management, executors, feedback mechanisms, and mutators, establishing the foundational components before the main fuzzing logic is implemented.\\n#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \\\"tui\\\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \\\"tui\\\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedbacks::{CrashFeedback, MaxMapFeedback},\\n    fuzzer::{Fuzzer, StdFuzzer},\\n    inputs::{BytesInput, HasTargetBytes},\\n    mutators::{StdScheduledMutator, StringCategoryRandMutator, StringSubcategoryRandMutator},\\n    observers::StdMapObserver,\\n    schedulers::QueueScheduler,\\n    stages::{mutational::StdMutationalStage, StringIdentificationStage},\\n    state::StdState,\\n    Evaluator,\\n};\\nuse libafl_bolts::{current_nanos, rands::StdRand, tuples::tuple_list, AsSlice};\\n\\n\",\n          \"The chunk contains necessary imports, declarations of static atomic variables, and the definition of an external C function, which are foundational components for managing memory allocation tracking in the context of a memory monitoring system within a Rust program that utilizes the libafl library for fuzzing.\\nuse core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \\\"C\\\" {\\n    fn libafl_check_malloc_size(ptr: *const c_void) -> usize;\\n}\\n\\nstatic RUNNING: AtomicBool = AtomicBool::new(false);\\nstatic OOMED: AtomicBool = AtomicBool::new(false);\\nstatic RSS_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n// 2GB, which is the default\\nstatic MALLOC_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n\\nstatic MALLOC_SIZE: AtomicUsize = AtomicUsize::new(0);\\n\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_3\",\n          \"doc_2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"raw_chunk\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"    // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \\\"tui\\\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\\\"{s}\\\"));\\n    #[cfg(feature = \\\"tui\\\")]\\n    let ui = TuiUI::with_version(String::from(\\\"Baby Fuzzer\\\"), String::from(\\\"0.0.1\\\"), false);\\n    #[cfg(feature = \\\"tui\\\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the various events generated during the fuzzing loop\\n    // such as the notification of the addition of a new item to the corpus\\n    let mut mgr = SimpleEventManager::new(mon);\\n\\n    // A queue policy to get testcasess from the corpus\\n    let scheduler = QueueScheduler::new();\\n\\n    // A fuzzer with feedbacks and a corpus scheduler\\n    let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);\\n\\n\",\n          \"#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \\\"tui\\\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \\\"tui\\\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedbacks::{CrashFeedback, MaxMapFeedback},\\n    fuzzer::{Fuzzer, StdFuzzer},\\n    inputs::{BytesInput, HasTargetBytes},\\n    mutators::{StdScheduledMutator, StringCategoryRandMutator, StringSubcategoryRandMutator},\\n    observers::StdMapObserver,\\n    schedulers::QueueScheduler,\\n    stages::{mutational::StdMutationalStage, StringIdentificationStage},\\n    state::StdState,\\n    Evaluator,\\n};\\nuse libafl_bolts::{current_nanos, rands::StdRand, tuples::tuple_list, AsSlice};\\n\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"full_doc\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \\\"C\\\" {\\n    fn libafl_check_malloc_size(ptr: *const c_void) -> usize;\\n}\\n\\nstatic RUNNING: AtomicBool = AtomicBool::new(false);\\nstatic OOMED: AtomicBool = AtomicBool::new(false);\\nstatic RSS_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n// 2GB, which is the default\\nstatic MALLOC_MAX: AtomicUsize = AtomicUsize::new(2 << 30);\\n\\nstatic MALLOC_SIZE: AtomicUsize = AtomicUsize::new(0);\\n\\n/// malloc hook which will be invoked if address sanitizer is present. Used to detect if the target makes a malloc call\\n/// that will exceed the permissible size\\n///\\n/// # Safety\\n/// Is only safe to call with valid freshly allocated pointers backed by allocations of `size`.\\n#[no_mangle]\\npub unsafe extern \\\"C\\\" fn __sanitizer_malloc_hook(ptr: *const c_void, size: usize) {\\n    if RUNNING.load(Ordering::Relaxed) {\\n        let size = match unsafe { libafl_check_malloc_size(ptr) } {\\n            0 => size, // either the malloc size function didn't work or it's really zero-sized\\n            real => real,\\n        };\\n\\n        let total = MALLOC_SIZE.fetch_add(size, Ordering::Relaxed) + size;\\n        if (size > MALLOC_MAX.load(Ordering::Relaxed) || total > RSS_MAX.load(Ordering::Relaxed))\\n            && !OOMED.swap(true, Ordering::Relaxed)\\n        {\\n            unsafe {\\n                // we need to kill the process in a way that immediately triggers the crash handler\\n                libc::raise(SIGABRT);\\n            }\\n        }\\n    }\\n}\\n\\n/// free hook which will be invoked if ASAN is present. Used to detect if the target makes a malloc call that will\\n/// exceed the permissible size\\n///\\n/// # Safety\\n/// Is only safe to call with valid allocated pointers, about to be freed.\\n#[no_mangle]\\npub unsafe extern \\\"C\\\" fn __sanitizer_free_hook(ptr: *const c_void) {\\n    if RUNNING.load(Ordering::Relaxed) {\\n        let size = unsafe { libafl_check_malloc_size(ptr) };\\n        MALLOC_SIZE\\n            .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |existing| {\\n                Some(existing.saturating_sub(size))\\n            })\\n            .expect(\\\"must complete successfully\\\");\\n    }\\n}\\n\\nconst OOM_OBS_NAME: &str = \\\"libfuzzer-like-oom\\\";\\n\\n/// Observer which detects if the target would run out of memory or otherwise violate the permissible usage of malloc\\n#[derive(Debug, Serialize, Deserialize)]\\npub struct OomObserver {\\n    oomed: bool,\\n}\\n\\nimpl OomObserver {\\n    /// Create a [`OomObserver`] with the provided `rss_max` (total heap size) and `malloc_max` (largest permissible malloc\\n    /// allocation size)\\n    pub fn new(rss_max: usize, malloc_max: usize) -> Self {\\n        RSS_MAX.store(rss_max, Ordering::Relaxed);\\n        MALLOC_MAX.store(malloc_max, Ordering::Relaxed);\\n        Self { oomed: false }\\n    }\\n}\\n\\nimpl Named for OomObserver {\\n    // strictly one name to prevent two from being registered\\n    fn name(&self) -> &str {\\n        OOM_OBS_NAME\\n    }\\n}\\n\\nimpl<S> Observer<S> for OomObserver\\nwhere\\n    S: UsesInput,\\n{\\n    fn pre_exec(&mut self, _state: &mut S, _input: &S::Input) -> Result<(), Error> {\\n        OOMED.store(false, Ordering::Relaxed);\\n        // must reset for platforms which do not offer malloc tracking\\n        MALLOC_SIZE.store(0, Ordering::Relaxed);\\n        RUNNING.store(true, Ordering::Relaxed);\\n        Ok(())\\n    }\\n\\n    fn post_exec(\\n        &mut self,\\n        _state: &mut S,\\n        _input: &S::Input,\\n        _exit_kind: &ExitKind,\\n    ) -> Result<(), Error> {\\n        RUNNING.store(false, Ordering::Relaxed);\\n        self.oomed = OOMED.load(Ordering::Relaxed);\\n        Ok(())\\n    }\\n\\n    fn pre_exec_child(&mut self, state: &mut S, input: &S::Input) -> Result<(), Error> {\\n        self.pre_exec(state, input)\\n    }\\n\\n    fn post_exec_child(\\n        &mut self,\\n        state: &mut S,\\n        input: &S::Input,\\n        exit_kind: &ExitKind,\\n    ) -> Result<(), Error> {\\n        self.post_exec(state, input, exit_kind)\\n    }\\n}\\n\\n/// Feedback for the similarly named [`OomObserver`] to detect if the target crashed due to an observed OOM\\n#[derive(Debug, Serialize, Deserialize, Copy, Clone, Default)]\\npub struct OomFeedback;\\n\\nimpl OomFeedback {\\n    /// Whether the target OOM'd in the last execution\\n    pub fn oomed() -> bool {\\n        OOMED.load(Ordering::Relaxed)\\n    }\\n}\\n\\nimpl Named for OomFeedback {\\n    fn name(&self) -> &str {\\n        \\\"oom\\\"\\n    }\\n}\\n\\nimpl<S> Feedback<S> for OomFeedback\\nwhere\\n    S: State,\\n{\\n    fn is_interesting<EM, OT>(\\n        &mut self,\\n        _state: &mut S,\\n        _manager: &mut EM,\\n        _input: &S::Input,\\n        _observers: &OT,\\n        _exit_kind: &ExitKind,\\n    ) -> Result<bool, Error>\\n    where\\n        EM: EventFirer<State = S>,\\n        OT: ObserversTuple<S>,\\n    {\\n        Ok(Self::oomed())\\n    }\\n}\\n\",\n          \"#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \\\"tui\\\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \\\"tui\\\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedbacks::{CrashFeedback, MaxMapFeedback},\\n    fuzzer::{Fuzzer, StdFuzzer},\\n    inputs::{BytesInput, HasTargetBytes},\\n    mutators::{StdScheduledMutator, StringCategoryRandMutator, StringSubcategoryRandMutator},\\n    observers::StdMapObserver,\\n    schedulers::QueueScheduler,\\n    stages::{mutational::StdMutationalStage, StringIdentificationStage},\\n    state::StdState,\\n    Evaluator,\\n};\\nuse libafl_bolts::{current_nanos, rands::StdRand, tuples::tuple_list, AsSlice};\\n\\n/// Coverage map with explicit assignments due to the lack of instrumentation\\nstatic mut SIGNALS: [u8; 64] = [0; 64];\\nstatic mut SIGNALS_PTR: *mut u8 = unsafe { SIGNALS.as_mut_ptr() };\\n\\n/// Assign a signal to the signals map\\nfn signals_set(idx: usize) {\\n    unsafe { write(SIGNALS_PTR.add(idx), 1) };\\n}\\n\\n#[allow(clippy::similar_names, clippy::manual_assert)]\\npub fn main() {\\n    // The closure that we want to fuzz\\n    let mut harness = |input: &BytesInput| {\\n        let target = input.target_bytes();\\n        let buf = target.as_slice();\\n        let goal = b\\\"abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz\\\";\\n        let mut i = 0;\\n        for _ in buf.iter().zip(goal).take_while(|(b, c)| b == c) {\\n            signals_set(i);\\n            i += 1;\\n        }\\n        if i == goal.len() {\\n            #[cfg(unix)]\\n            panic!(\\\"Artificial bug triggered =)\\\");\\n\\n            #[cfg(windows)]\\n            unsafe {\\n                write_volatile(0 as *mut u32, 0);\\n            }\\n        }\\n        ExitKind::Ok\\n    };\\n\\n    // Create an observation channel using the signals map\\n    let observer = unsafe { StdMapObserver::from_mut_ptr(\\\"signals\\\", SIGNALS_PTR, SIGNALS.len()) };\\n\\n    // Feedback to rate the interestingness of an input\\n    let mut feedback = MaxMapFeedback::new(&observer);\\n\\n    // A feedback to choose if an input is a solution or not\\n    let mut objective = CrashFeedback::new();\\n\\n    // create a State from scratch\\n    let mut state = StdState::new(\\n        // RNG\\n        StdRand::with_seed(current_nanos()),\\n        // Corpus that will be evolved, we keep it in memory for performance\\n        InMemoryCorpus::new(),\\n        // Corpus in which we store solutions (crashes in this example),\\n        // on disk so the user can get them after stopping the fuzzer\\n        OnDiskCorpus::new(PathBuf::from(\\\"./crashes\\\")).unwrap(),\\n        // States of the feedbacks.\\n        // The feedbacks can report the data that should persist in the State.\\n        &mut feedback,\\n        // Same for objective feedbacks\\n        &mut objective,\\n    )\\n    .unwrap();\\n\\n    // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \\\"tui\\\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\\\"{s}\\\"));\\n    #[cfg(feature = \\\"tui\\\")]\\n    let ui = TuiUI::with_version(String::from(\\\"Baby Fuzzer\\\"), String::from(\\\"0.0.1\\\"), false);\\n    #[cfg(feature = \\\"tui\\\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the various events generated during the fuzzing loop\\n    // such as the notification of the addition of a new item to the corpus\\n    let mut mgr = SimpleEventManager::new(mon);\\n\\n    // A queue policy to get testcasess from the corpus\\n    let scheduler = QueueScheduler::new();\\n\\n    // A fuzzer with feedbacks and a corpus scheduler\\n    let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);\\n\\n    // Create the executor for an in-process function with just one observer\\n    let mut executor = InProcessExecutor::new(\\n        &mut harness,\\n        tuple_list!(observer),\\n        &mut fuzzer,\\n        &mut state,\\n        &mut mgr,\\n    )\\n    .expect(\\\"Failed to create the Executor\\\");\\n\\n    // Generate 8 initial inputs\\n    fuzzer\\n        .evaluate_input(\\n            &mut state,\\n            &mut executor,\\n            &mut mgr,\\n            BytesInput::new(vec![b'a']),\\n        )\\n        .unwrap();\\n\\n    // Setup a mutational stage with a basic bytes mutator\\n    let mutator = StdScheduledMutator::new(tuple_list!(\\n        StringCategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator,\\n        StringSubcategoryRandMutator\\n    ));\\n    let mut stages = tuple_list!(\\n        StringIdentificationStage::new(),\\n        StdMutationalStage::transforming(mutator)\\n    );\\n\\n    fuzzer\\n        .fuzz_loop(&mut stages, &mut executor, &mut state, &mut mgr)\\n        .expect(\\\"Error in the fuzzing loop\\\");\\n}\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_2_chunk_4\",\n          \"doc_2_chunk_0\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"original_index\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 2,\n        \"min\": 0,\n        \"max\": 4,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          0,\n          4\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_relevance_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.032786883413791656,\n          0.032258063554763794\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-55f8ab0a-c7e1-4f5b-9270-14cc09233ed8\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>doc_id</th>\n",
       "      <th>raw_chunk</th>\n",
       "      <th>full_doc</th>\n",
       "      <th>chunk_id</th>\n",
       "      <th>original_index</th>\n",
       "      <th>_relevance_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>This chunk is part of the main function in a fuzzing application, specifically focusing on the setup of the monitor for displaying fuzzer statistics and the event manager for handling events during the fuzzing loop. It follows the initialization of state, feedback mechanisms, and sets up the fuzzer with a scheduling policy for managing test cases from the corpus.\\n    // The Monitor trait defi...</td>\n",
       "      <td>doc_2</td>\n",
       "      <td>// The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \"tui\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\"{s}\"));\\n    #[cfg(feature = \"tui\")]\\n    let ui = TuiUI::with_version(String::from(\"Baby Fuzzer\"), String::from(\"0.0.1\"), false);\\n    #[cfg(feature = \"tui\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the ...</td>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>doc_2_chunk_4</td>\n",
       "      <td>4</td>\n",
       "      <td>0.032787</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The chunk contains Rust code that includes the necessary imports and configurations for a fuzzing framework using the libafl library. It sets up the environment for the fuzzer, including the configuration for different operating systems and features, and imports various modules required for corpus management, executors, feedback mechanisms, and mutators, establishing the foundational component...</td>\n",
       "      <td>doc_2</td>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>#[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...</td>\n",
       "      <td>doc_2_chunk_0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.032258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The chunk contains necessary imports, declarations of static atomic variables, and the definition of an external C function, which are foundational components for managing memory allocation tracking in the context of a memory monitoring system within a Rust program that utilizes the libafl library for fuzzing.\\nuse core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsi...</td>\n",
       "      <td>doc_3</td>\n",
       "      <td>use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...</td>\n",
       "      <td>use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...</td>\n",
       "      <td>doc_3_chunk_0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.015873</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-55f8ab0a-c7e1-4f5b-9270-14cc09233ed8')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-55f8ab0a-c7e1-4f5b-9270-14cc09233ed8 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-55f8ab0a-c7e1-4f5b-9270-14cc09233ed8');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-699d8140-6561-4a7b-b4e5-e69e964dfb22\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-699d8140-6561-4a7b-b4e5-e69e964dfb22')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-699d8140-6561-4a7b-b4e5-e69e964dfb22 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                                                                                                                                                                                                                                                                                                                                                                                              text  \\\n",
       "0  This chunk is part of the main function in a fuzzing application, specifically focusing on the setup of the monitor for displaying fuzzer statistics and the event manager for handling events during the fuzzing loop. It follows the initialization of state, feedback mechanisms, and sets up the fuzzer with a scheduling policy for managing test cases from the corpus.\\n    // The Monitor trait defi...   \n",
       "1  The chunk contains Rust code that includes the necessary imports and configurations for a fuzzing framework using the libafl library. It sets up the environment for the fuzzer, including the configuration for different operating systems and features, and imports various modules required for corpus management, executors, feedback mechanisms, and mutators, establishing the foundational component...   \n",
       "2  The chunk contains necessary imports, declarations of static atomic variables, and the definition of an external C function, which are foundational components for managing memory allocation tracking in the context of a memory monitoring system within a Rust program that utilizes the libafl library for fuzzing.\\nuse core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsi...   \n",
       "\n",
       "  doc_id  \\\n",
       "0  doc_2   \n",
       "1  doc_2   \n",
       "2  doc_3   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                         raw_chunk  \\\n",
       "0      // The Monitor trait define how the fuzzer stats are displayed to the user\\n    #[cfg(not(feature = \"tui\"))]\\n    let mon = SimpleMonitor::new(|s| println!(\"{s}\"));\\n    #[cfg(feature = \"tui\")]\\n    let ui = TuiUI::with_version(String::from(\"Baby Fuzzer\"), String::from(\"0.0.1\"), false);\\n    #[cfg(feature = \"tui\")]\\n    let mon = TuiMonitor::new(ui);\\n\\n    // The event manager handle the ...   \n",
       "1  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "2  use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                          full_doc  \\\n",
       "0  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "1  #[cfg(windows)]\\nuse std::ptr::write_volatile;\\nuse std::{path::PathBuf, ptr::write};\\n\\n#[cfg(feature = \"tui\")]\\nuse libafl::monitors::tui::{ui::TuiUI, TuiMonitor};\\n#[cfg(not(feature = \"tui\"))]\\nuse libafl::monitors::SimpleMonitor;\\nuse libafl::{\\n    corpus::{InMemoryCorpus, OnDiskCorpus},\\n    events::SimpleEventManager,\\n    executors::{inprocess::InProcessExecutor, ExitKind},\\n    feedba...   \n",
       "2  use core::{ffi::c_void, fmt::Debug};\\nuse std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};\\n\\nuse libafl::{\\n    events::EventFirer,\\n    executors::ExitKind,\\n    feedbacks::Feedback,\\n    inputs::UsesInput,\\n    observers::{Observer, ObserversTuple},\\n    state::State,\\n    Error,\\n};\\nuse libafl_bolts::Named;\\nuse libc::SIGABRT;\\nuse serde::{Deserialize, Serialize};\\n\\nextern \"C\" {\\n...   \n",
       "\n",
       "        chunk_id  original_index  _relevance_score  \n",
       "0  doc_2_chunk_4               4          0.032787  \n",
       "1  doc_2_chunk_0               0          0.032258  \n",
       "2  doc_3_chunk_0               0          0.015873  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "context_table.search(QUERY, query_type=\"hybrid\").limit(3).to_pandas().drop(\n",
    "    [\"vector\", \"original_uuid\"], axis=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IU2zz1AxOgZp"
   },
   "source": [
    "Here we are seeing the difference between the results while using normal retrieval and contextual retrieval with prompt caching and Hybrid search and LanceDB reranking API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "86UprNgwPpqY"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
